6  Data transformation

6.1 Dependencies and data

[[This data comes from a real study on implicit and self-reported evaluations. The implementation of the procedure produced three data files: one for the demographics data, one for the self-reported evaluations, and one for the implicit measure (the ‘Affect Misattribution Procedure’). This script uses each of these to learn and practice functions from the readr, dplyr, and tidyr libraries that are commonly used for data wrangling. In doing so, we will learn how to do many of the steps involved in data processing for a given experiment.]]

Code
library(dplyr)
library(tidyr)
library(readr)
library(janitor) # for clean_names() and round_half_up()
library(roundwork) # for round_up()
library(stringr)
library(knitr) # for kable()
library(kableExtra) # for kable_classic()

# demographics data
data_demographics_raw <- read_csv(file = "../data/raw/data_demographics_raw.csv") 

# self report measure data
data_selfreport_raw <- read_csv(file = "../data/raw/data_selfreport_raw.csv") 

# affect attribution procedure data
data_amp_raw <- read_csv(file = "../data/raw/data_amp_raw.csv")

# clean column names
data_demographics_clean_names <- data_demographics_raw %>%
  clean_names() 

data_selfreport_clean_names <- data_selfreport_raw %>%
  clean_names() 

data_amp_clean_names <- data_amp_raw %>%
  clean_names() 

6.2 Renaming columns

Often variable names are not intuitive. An early step in any data wrangling is to make them more intuitive.

Rename the self reports and AMP data too.

Code
data_demographics_renamed <- data_demographics_clean_names %>%
  rename(unique_id = subject,
         item = trialcode,
         rt_ms = latency) 

data_selfreport_renamed <- data_selfreport_clean_names %>%
  rename(unique_id = subject,
         item = trialcode,
         rt_ms = latency) 

data_amp_renamed <- data_amp_clean_names %>%
  rename(unique_id = subject,
         block_type = blockcode,
         trial_type = trialcode,
         trial_id = blocknum_and_trialnum,
         rt_ms = latency) 

6.3 Selecting columns

Not all variables are useful to you. An early step in any data wrangling is to drop the columns that you don’t need.

Select the self reports and AMP data too.

Code
data_demographics_selected_columns <- data_demographics_renamed %>%
  select(unique_id, item, response)

data_selfreport_selected_columns <- data_selfreport_renamed %>%
  select(unique_id, item, response, rt_ms)

data_amp_selected_columns <- data_amp_renamed %>%
  select(unique_id, 
         # methods variables
         block_type,
         trial_type,
         trial_id,
         # responses 
         rt_ms, 
         correct)

6.3.1 More flexible selecting

Code
dat <- data.frame(
  var_1_1 = rnorm(n = 100),
  var_1_2 = rnorm(n = 100),
  var_1_3 = rnorm(n = 100),
  var_1_4 = rnorm(n = 100),
  var_1_5 = rnorm(n = 100),
  var_2_1 = rnorm(n = 100),
  var_2_2 = rnorm(n = 100),
  var_2_3 = rnorm(n = 100),
  var_2_4 = rnorm(n = 100),
  var_2_5 = rnorm(n = 100)
)

dat |>
  select(starts_with("var_1")) 
         var_1_1      var_1_2       var_1_3      var_1_4      var_1_5
1    0.774106313 -0.794155974 -1.6375029034  0.759612478  0.971791323
2   -1.903705659  0.488561704  1.0262564252 -1.633467941  0.490470439
3    0.356947611  0.179722381 -0.7257685424  0.744602894  0.180996855
4    1.295393358 -1.056151692 -0.0611408296  0.758839565  1.585665152
5    2.384195479  0.363269226 -0.4221133173 -1.519768914 -0.880791357
6    1.988752941 -0.937638237 -0.6148904435 -1.924397075 -1.270684762
7   -0.496321792  0.636663282  0.2622821721 -0.425151053 -0.111424689
8   -1.271753236 -1.554947263 -0.5609248027 -0.187608083 -0.204193754
9   -1.712707179 -1.117131964  0.1613662143 -1.087720715 -0.813667654
10  -0.148297525  0.922042681  2.1694598913  0.734508671  0.581378390
11   1.198779359 -1.422979752  1.0665209624 -0.523409419  1.943186405
12  -0.895295624 -1.718779017  0.8897932169 -0.613094506 -0.051042670
13  -0.353169613 -0.528479056  0.8856571286  1.290701722 -0.597071630
14  -0.134910959 -1.143547709 -1.1568779118  1.215844061  0.339871166
15  -1.205254177  0.583225821 -0.3263742003 -1.014899890 -0.524982100
16  -0.555999292  0.753317168 -0.2835660656  0.127206776 -0.884115416
17  -0.520265454  0.411517786  1.1169538523  2.031941330 -0.290917713
18   1.088480359 -0.759178340 -0.8456573318 -1.618438251  0.104617757
19  -1.437028561  0.681945875  0.0780275591 -0.206751045 -0.027875871
20  -0.822474953  1.397063605 -0.9940777701 -0.120558921  1.206766045
21  -1.406663495 -0.862626067 -0.0455379310 -1.830148426  1.222472199
22  -1.023054053 -1.040055409 -0.6416746798 -0.533770548 -0.261013448
23   0.584655757 -0.002026757  0.5946667199 -1.525419356 -1.078025934
24   0.513600194 -0.130611132 -0.5754246523 -1.135584980  0.448336484
25   0.686671983  0.857968455  0.5782139786 -0.694948540  0.474555269
26  -1.027631299  0.005281957 -1.6764636237 -0.175010877  0.952826636
27   0.430560578  1.831328327  0.7306339133  0.327920843 -1.874445783
28   0.734408878 -2.014126255 -0.0366862017  0.678686439  2.111368673
29   1.474121875 -1.162814845  1.4150947265  0.531144080  0.802419224
30   2.664475936 -0.302060443 -1.2939690825  1.122965178 -1.574296524
31  -0.399286993 -1.229505441  1.7277021421  0.551170174  1.683100876
32  -2.064963346 -0.340754458  0.9550875644 -0.584300924 -1.231348414
33   0.432027973 -0.660450305 -1.1945888981  0.482052130  0.100214447
34   0.640508948  2.637624330  0.2034681246  0.227794052  0.114399025
35  -0.028491439  1.373028471  0.0444810602 -1.962402741 -0.735970939
36  -1.104422060 -1.150498258 -1.1943393180 -0.645815280 -1.304630712
37  -2.099696701 -0.611922520 -0.2466345657 -1.521821920 -0.762003182
38   0.723466752 -0.814343194 -0.5706255151 -0.891777482  0.331439046
39   0.106738467 -1.002936086  1.9149821985 -0.875227385 -1.425069746
40  -1.072208987  0.352171826 -0.8624357143  0.379481959  1.257291948
41   0.129465379  0.527203733 -0.7893097078 -0.826578378  0.283170062
42  -0.678308466 -0.703635518  1.0512154256  0.755100250  0.373767375
43   0.394083789  0.387415766 -0.2557039550 -0.041512570  0.842759500
44   0.752720547 -0.627644262  1.6127521883 -1.480336629 -0.758113579
45  -2.106986981  0.743759543 -0.0590668467  1.066095138  0.888684962
46   0.751582570  0.019643899  0.4030071462 -0.398551779  0.122934507
47  -1.111830845 -0.745744431 -0.2910898860  0.377559343 -0.488106081
48  -1.826586459 -0.270276349 -0.5990528278  0.591021082  0.628872486
49   1.010747215 -1.498766306  0.0017399004 -1.558828553 -0.518159576
50  -0.994849748 -0.124454090  0.0372906839  0.610054135  1.379612932
51   1.824116089  0.590505024  0.2427646606  0.006676295  1.386482776
52  -1.308074060 -0.918293830 -0.6093243451  0.248535190 -0.173259301
53  -0.843439744 -0.461695148 -0.4104482880 -1.496754534  1.053006741
54   0.658218728 -0.296568062  0.8156994382 -0.831975571 -0.185435534
55   0.945128963 -0.769734659 -0.0480326557 -0.692149808  1.191616767
56  -0.034747638 -0.687727218  1.3302603018  1.016576182  0.127769368
57   0.707337005  1.309788424 -0.1053727187 -0.676019959 -0.595923415
58   0.544143696  2.204749190  0.6808523572 -0.524554083 -0.106119148
59   0.052203821 -0.312323430 -2.2213771529  1.050483509 -1.921564553
60  -0.197296198  0.028110460  1.4528393335  0.751801831 -1.153939026
61   0.160268382  1.486902044  1.6632427353  0.810348936  0.005460479
62  -1.286812187 -0.320204576  1.0266120025  0.248099672  1.690527683
63   0.947244995  0.480749861 -1.1526369025 -0.989229144  0.595119668
64   0.213780014  0.906689542  0.1396219956  0.775480171  0.848650818
65   0.554332744 -0.758368882 -0.2976903192  1.040531941  0.706994188
66   0.647141892  0.679551668  0.0005537302 -0.102636071  0.171949658
67  -0.791344353 -0.663112388 -0.1142509727 -1.149226595 -0.532983464
68   0.910142339  0.672017990  0.1448877179  0.334895493  0.841830752
69  -0.290896325 -0.266904747 -0.2219462298 -0.226709635 -0.731403160
70   0.005199739  1.076485614  0.6337085600 -0.306415140  0.763682204
71   0.176116056  0.577400535 -1.3568127935 -0.612705897 -0.994286536
72   0.984500155 -0.381130540 -1.7818213519  0.662204038 -0.360279932
73  -1.504024888 -1.603143679  0.4923821000  0.660598280 -0.838052261
74   0.089637762 -0.878754373 -0.7964264141 -0.160795970  0.181431091
75  -0.308630069 -0.671512812  1.5883057505 -0.620002783  1.309281394
76   2.170810061 -0.162405440  0.4066217435 -1.915515604 -0.996831057
77  -0.065166404  0.655047231 -0.4610041848  2.054296532 -2.168841568
78  -0.602498219  0.735410166  0.5165000581  1.245801249 -1.083219817
79   2.839734664 -0.014406165  1.2320561050 -0.676419020 -0.295671967
80   0.872776823  0.422264300 -0.0490700839  0.173376105 -0.358370171
81  -1.052187555 -1.056090499 -0.6164533063  0.029387423 -0.531396958
82  -0.096417279 -0.096909334 -0.2889612854  0.571608576  0.664053642
83   0.685801209 -0.549503178 -0.9746171477 -1.002351889 -0.208087501
84   0.009856252 -0.169492648 -0.6622475931 -1.276538122  0.745205578
85  -2.475053166 -0.694779442  0.3321315434  1.506197207  1.226729547
86  -0.141769895 -1.147059511  0.7230946510  0.390431718 -1.961549744
87  -0.507682640  0.369873544 -0.0400432699  1.652088344  0.678987206
88   1.043042567  1.197600871 -0.5901895793 -0.554883182 -0.615839935
89  -1.078677987 -0.500192051 -1.1300992815  0.032103053 -0.660579352
90   0.701878931 -0.066366220  0.2287022288  0.344323161  1.246654560
91  -0.953939550 -0.693969827 -0.7146089894  1.216466605  0.484139341
92   1.090486951 -0.660878940 -1.4433460933  0.109581242  0.797047142
93  -0.415927874 -1.358191323  1.8580772761  0.919521292  0.686220872
94   0.494659915  0.434809698 -1.0006836142  0.810450142  0.935337513
95   0.073076702  0.192077745  0.3934982417  1.679128828  1.034474458
96  -0.186347708  0.576106407 -0.8752268682  0.816882968 -0.089281124
97   0.024608164  0.713698642  1.1308235522 -0.892900074 -1.230959254
98   0.162673347 -0.362313111  1.0889817110  1.409574756  0.271771037
99   0.084885682  0.623109842  2.1392968012  1.017824866  1.737503063
100 -0.014340286  1.536056139  0.0175323367 -0.671150305  1.100279688
Code
dat |>
  select(ends_with("var_1")) 
data frame with 0 columns and 100 rows
Code
dat |>
  select(contains("_1_")) 
         var_1_1      var_1_2       var_1_3      var_1_4      var_1_5
1    0.774106313 -0.794155974 -1.6375029034  0.759612478  0.971791323
2   -1.903705659  0.488561704  1.0262564252 -1.633467941  0.490470439
3    0.356947611  0.179722381 -0.7257685424  0.744602894  0.180996855
4    1.295393358 -1.056151692 -0.0611408296  0.758839565  1.585665152
5    2.384195479  0.363269226 -0.4221133173 -1.519768914 -0.880791357
6    1.988752941 -0.937638237 -0.6148904435 -1.924397075 -1.270684762
7   -0.496321792  0.636663282  0.2622821721 -0.425151053 -0.111424689
8   -1.271753236 -1.554947263 -0.5609248027 -0.187608083 -0.204193754
9   -1.712707179 -1.117131964  0.1613662143 -1.087720715 -0.813667654
10  -0.148297525  0.922042681  2.1694598913  0.734508671  0.581378390
11   1.198779359 -1.422979752  1.0665209624 -0.523409419  1.943186405
12  -0.895295624 -1.718779017  0.8897932169 -0.613094506 -0.051042670
13  -0.353169613 -0.528479056  0.8856571286  1.290701722 -0.597071630
14  -0.134910959 -1.143547709 -1.1568779118  1.215844061  0.339871166
15  -1.205254177  0.583225821 -0.3263742003 -1.014899890 -0.524982100
16  -0.555999292  0.753317168 -0.2835660656  0.127206776 -0.884115416
17  -0.520265454  0.411517786  1.1169538523  2.031941330 -0.290917713
18   1.088480359 -0.759178340 -0.8456573318 -1.618438251  0.104617757
19  -1.437028561  0.681945875  0.0780275591 -0.206751045 -0.027875871
20  -0.822474953  1.397063605 -0.9940777701 -0.120558921  1.206766045
21  -1.406663495 -0.862626067 -0.0455379310 -1.830148426  1.222472199
22  -1.023054053 -1.040055409 -0.6416746798 -0.533770548 -0.261013448
23   0.584655757 -0.002026757  0.5946667199 -1.525419356 -1.078025934
24   0.513600194 -0.130611132 -0.5754246523 -1.135584980  0.448336484
25   0.686671983  0.857968455  0.5782139786 -0.694948540  0.474555269
26  -1.027631299  0.005281957 -1.6764636237 -0.175010877  0.952826636
27   0.430560578  1.831328327  0.7306339133  0.327920843 -1.874445783
28   0.734408878 -2.014126255 -0.0366862017  0.678686439  2.111368673
29   1.474121875 -1.162814845  1.4150947265  0.531144080  0.802419224
30   2.664475936 -0.302060443 -1.2939690825  1.122965178 -1.574296524
31  -0.399286993 -1.229505441  1.7277021421  0.551170174  1.683100876
32  -2.064963346 -0.340754458  0.9550875644 -0.584300924 -1.231348414
33   0.432027973 -0.660450305 -1.1945888981  0.482052130  0.100214447
34   0.640508948  2.637624330  0.2034681246  0.227794052  0.114399025
35  -0.028491439  1.373028471  0.0444810602 -1.962402741 -0.735970939
36  -1.104422060 -1.150498258 -1.1943393180 -0.645815280 -1.304630712
37  -2.099696701 -0.611922520 -0.2466345657 -1.521821920 -0.762003182
38   0.723466752 -0.814343194 -0.5706255151 -0.891777482  0.331439046
39   0.106738467 -1.002936086  1.9149821985 -0.875227385 -1.425069746
40  -1.072208987  0.352171826 -0.8624357143  0.379481959  1.257291948
41   0.129465379  0.527203733 -0.7893097078 -0.826578378  0.283170062
42  -0.678308466 -0.703635518  1.0512154256  0.755100250  0.373767375
43   0.394083789  0.387415766 -0.2557039550 -0.041512570  0.842759500
44   0.752720547 -0.627644262  1.6127521883 -1.480336629 -0.758113579
45  -2.106986981  0.743759543 -0.0590668467  1.066095138  0.888684962
46   0.751582570  0.019643899  0.4030071462 -0.398551779  0.122934507
47  -1.111830845 -0.745744431 -0.2910898860  0.377559343 -0.488106081
48  -1.826586459 -0.270276349 -0.5990528278  0.591021082  0.628872486
49   1.010747215 -1.498766306  0.0017399004 -1.558828553 -0.518159576
50  -0.994849748 -0.124454090  0.0372906839  0.610054135  1.379612932
51   1.824116089  0.590505024  0.2427646606  0.006676295  1.386482776
52  -1.308074060 -0.918293830 -0.6093243451  0.248535190 -0.173259301
53  -0.843439744 -0.461695148 -0.4104482880 -1.496754534  1.053006741
54   0.658218728 -0.296568062  0.8156994382 -0.831975571 -0.185435534
55   0.945128963 -0.769734659 -0.0480326557 -0.692149808  1.191616767
56  -0.034747638 -0.687727218  1.3302603018  1.016576182  0.127769368
57   0.707337005  1.309788424 -0.1053727187 -0.676019959 -0.595923415
58   0.544143696  2.204749190  0.6808523572 -0.524554083 -0.106119148
59   0.052203821 -0.312323430 -2.2213771529  1.050483509 -1.921564553
60  -0.197296198  0.028110460  1.4528393335  0.751801831 -1.153939026
61   0.160268382  1.486902044  1.6632427353  0.810348936  0.005460479
62  -1.286812187 -0.320204576  1.0266120025  0.248099672  1.690527683
63   0.947244995  0.480749861 -1.1526369025 -0.989229144  0.595119668
64   0.213780014  0.906689542  0.1396219956  0.775480171  0.848650818
65   0.554332744 -0.758368882 -0.2976903192  1.040531941  0.706994188
66   0.647141892  0.679551668  0.0005537302 -0.102636071  0.171949658
67  -0.791344353 -0.663112388 -0.1142509727 -1.149226595 -0.532983464
68   0.910142339  0.672017990  0.1448877179  0.334895493  0.841830752
69  -0.290896325 -0.266904747 -0.2219462298 -0.226709635 -0.731403160
70   0.005199739  1.076485614  0.6337085600 -0.306415140  0.763682204
71   0.176116056  0.577400535 -1.3568127935 -0.612705897 -0.994286536
72   0.984500155 -0.381130540 -1.7818213519  0.662204038 -0.360279932
73  -1.504024888 -1.603143679  0.4923821000  0.660598280 -0.838052261
74   0.089637762 -0.878754373 -0.7964264141 -0.160795970  0.181431091
75  -0.308630069 -0.671512812  1.5883057505 -0.620002783  1.309281394
76   2.170810061 -0.162405440  0.4066217435 -1.915515604 -0.996831057
77  -0.065166404  0.655047231 -0.4610041848  2.054296532 -2.168841568
78  -0.602498219  0.735410166  0.5165000581  1.245801249 -1.083219817
79   2.839734664 -0.014406165  1.2320561050 -0.676419020 -0.295671967
80   0.872776823  0.422264300 -0.0490700839  0.173376105 -0.358370171
81  -1.052187555 -1.056090499 -0.6164533063  0.029387423 -0.531396958
82  -0.096417279 -0.096909334 -0.2889612854  0.571608576  0.664053642
83   0.685801209 -0.549503178 -0.9746171477 -1.002351889 -0.208087501
84   0.009856252 -0.169492648 -0.6622475931 -1.276538122  0.745205578
85  -2.475053166 -0.694779442  0.3321315434  1.506197207  1.226729547
86  -0.141769895 -1.147059511  0.7230946510  0.390431718 -1.961549744
87  -0.507682640  0.369873544 -0.0400432699  1.652088344  0.678987206
88   1.043042567  1.197600871 -0.5901895793 -0.554883182 -0.615839935
89  -1.078677987 -0.500192051 -1.1300992815  0.032103053 -0.660579352
90   0.701878931 -0.066366220  0.2287022288  0.344323161  1.246654560
91  -0.953939550 -0.693969827 -0.7146089894  1.216466605  0.484139341
92   1.090486951 -0.660878940 -1.4433460933  0.109581242  0.797047142
93  -0.415927874 -1.358191323  1.8580772761  0.919521292  0.686220872
94   0.494659915  0.434809698 -1.0006836142  0.810450142  0.935337513
95   0.073076702  0.192077745  0.3934982417  1.679128828  1.034474458
96  -0.186347708  0.576106407 -0.8752268682  0.816882968 -0.089281124
97   0.024608164  0.713698642  1.1308235522 -0.892900074 -1.230959254
98   0.162673347 -0.362313111  1.0889817110  1.409574756  0.271771037
99   0.084885682  0.623109842  2.1392968012  1.017824866  1.737503063
100 -0.014340286  1.536056139  0.0175323367 -0.671150305  1.100279688

6.4 Practice the pipe again

Combine the above function calls using pipes. Notice how this involves fewer objects in your environment, and therefore less potential for confusion or error.

Remember: this is how we solve coding problems: break them down into smaller tasks and problems, get each of them working individually, then combine them together again. When you only see the end product, it’s easy to think the author simply wrote the code as you see it, when they often wrote much more verbose chunks of code and then combined them together.

Rewrite the rename and select calls for the AMP and self report data too.

Code
# remove all objects in environment
rm(list = ls())


data_demographics_trimmed <-
  # read in the data
  read_csv("../data/raw/data_demographics_raw.csv") %>%
  
  # convert to snake case
  clean_names() %>%
  
  # make names more intuitive
  rename(unique_id = subject,
         item = trialcode) %>%
  
  # retain only columns of interest
  select(unique_id, item, response)


data_selfreport_trimmed <- 
  read_csv("../data/raw/data_selfreport_raw.csv") %>%
  clean_names() %>%
  rename(unique_id = subject,
         item = trialcode) %>%
  select(unique_id, item, response)

data_amp_trimmed <- 
  read_csv("../data/raw/data_amp_raw.csv") %>%
  clean_names() %>%
  rename(unique_id = subject,
         block_type = blockcode,
         trial_type = trialcode,
         trial_id = blocknum_and_trialnum,
         rt_ms = latency) %>%
  select(unique_id, 
         # methods variables
         block_type,
         trial_type,
         trial_id,
         # responses 
         rt_ms, 
         correct)

6.5 Counting frequencies

After renaming and selecting columns, we know what columns we have. But what rows do we have in each of these? What might we need to exclude, change, work with in some way later on? It is very useful to use count() to obtain the frequency of each unique value of a given column

Code
data_demographics_trimmed %>%
  count(item)
# A tibble: 2 × 2
  item       n
  <chr>  <int>
1 age      100
2 gender   100
Code
data_demographics_trimmed %>%
  count(response)
# A tibble: 50 × 2
   response     n
   <chr>    <int>
 1 18           1
 2 19           4
 3 20           1
 4 21           6
 5 22           2
 6 23           6
 7 24           1
 8 25           3
 9 26           4
10 27           5
# ℹ 40 more rows
Code
data_selfreport_trimmed %>%
  count(item)
# A tibble: 4 × 2
  item             n
  <chr>        <int>
1 instructions    99
2 like            99
3 positive        97
4 prefer          97
Code
data_selfreport_trimmed %>%
  count(response)
# A tibble: 9 × 2
  response     n
  <chr>    <int>
1 1          193
2 2           43
3 3           26
4 4           16
5 5            7
6 57          99
7 6            4
8 7            3
9 Ctrl+'B'     1
Code
data_amp_trimmed %>%
  count(trial_type)
# A tibble: 5 × 2
  trial_type                  n
  <chr>                   <int>
1 instructions                2
2 prime_negative           3604
3 prime_negative_practice   508
4 prime_positive           3604
5 prime_positive_practice   506
Code
data_amp_trimmed %>%
  count(block_type)
# A tibble: 2 × 2
  block_type     n
  <chr>      <int>
1 practice    1014
2 test        7210
Code
data_amp_trimmed %>%
  count(correct)
# A tibble: 2 × 2
  correct     n
    <dbl> <int>
1       0  3440
2       1  4784
Code
data_amp_trimmed %>%
  count(rt_ms)
# A tibble: 2,165 × 2
   rt_ms     n
   <dbl> <int>
 1     1     2
 2     3     1
 3     5     2
 4     8     1
 5     9     1
 6    11     1
 7    13     1
 8    14     1
 9    16     1
10    18     1
# ℹ 2,155 more rows

6.5.1 Frequncies of sets of columns

Note that it is also possible to use count to obtain the frequencies of sets of unique values across columns, e.g., unique combinations of item and response.

Code
data_demographics_trimmed %>%
  count(item)
# A tibble: 2 × 2
  item       n
  <chr>  <int>
1 age      100
2 gender   100
Code
data_demographics_trimmed %>%
  count(response)
# A tibble: 50 × 2
   response     n
   <chr>    <int>
 1 18           1
 2 19           4
 3 20           1
 4 21           6
 5 22           2
 6 23           6
 7 24           1
 8 25           3
 9 26           4
10 27           5
# ℹ 40 more rows
Code
data_demographics_trimmed %>%
  count(item, response)
# A tibble: 51 × 3
   item  response     n
   <chr> <chr>    <int>
 1 age   18           1
 2 age   19           4
 3 age   20           1
 4 age   21           6
 5 age   22           2
 6 age   23           5
 7 age   24           1
 8 age   25           3
 9 age   26           4
10 age   27           5
# ℹ 41 more rows

It can be useful to arrange the output by the frequencies.

Code
data_demographics_trimmed %>%
  count(item, response) %>%
  arrange(desc(n)) # arrange in descending order
# A tibble: 51 × 3
   item   response     n
   <chr>  <chr>    <int>
 1 gender Male        36
 2 gender female      27
 3 gender male        18
 4 gender Female      11
 5 age    21           6
 6 age    23           5
 7 age    27           5
 8 age    32           5
 9 age    19           4
10 age    26           4
# ℹ 41 more rows

6.6 Filtering rows

Once we know the contents of our columns, we may wish to exclude some rows using filter().

You can specify the logical test for filtering in many ways, including equivalence (==), negation (!=), or membership (%in%). It is often better to define what you do want (using equivalence or membership) rather than what you do not want (negation), as negations are less robust to new data with weird values you didn’t think of when you wrote the code. E.g., you could specify gender != "non-binary" but this would not catch non binary. If you were for example looking to include only men and women, instead use gender %in% c("man", "woman").*

*[This is just an example; there is usually no good a priori reason to exclude gender diverse participants]

Code
# example using equivalence
example_equivalence <- data_amp_trimmed %>%
  filter(block_type == "test")

# example using negation
example_negation <- data_selfreport_trimmed %>%
  filter(item != "instructions")

# example using membership
example_membership <- data_selfreport_trimmed %>%
  filter(item %in% c("positive", "prefer", "like"))

6.6.1 Multiple criteria, ‘and’ or ‘or’ combinations

You can also have multiple criteria in your filter call, both of which have to be met (x & y), or either one of which have to be met (x | y).

Code
example_multiple_criteria_1 <- data_amp_trimmed %>%
  filter(block_type != "test" & correct == 1)

example_multiple_criteria_2 <- data_amp_trimmed %>%
  filter(block_type != "test" | correct == 1)

# note that these provide different results - make sure you understand why
identical(example_multiple_criteria_1, example_multiple_criteria_2)
[1] FALSE

6.6.2 Practice filtering

Filter the self reports data frame to remove the instructions. Filter the AMP data frame to remove the practice blocks and the instruction trials.

Code
data_selfreport_trials <- data_selfreport_trimmed %>%
  #filter(item != "instructions")
  filter(item %in% c("positive", "prefer", "like"))

# this probably contains things we don't want
data_amp_trimmed %>%
  count(trial_type, block_type)
# A tibble: 5 × 3
  trial_type              block_type     n
  <chr>                   <chr>      <int>
1 instructions            test           2
2 prime_negative          test        3604
3 prime_negative_practice practice     508
4 prime_positive          test        3604
5 prime_positive_practice practice     506
Code
# we exclude them
data_amp_test_trials <- data_amp_trimmed %>%
  filter(block_type == "test") %>%
  filter(trial_type != "instructions")

# check they are excluded
data_amp_test_trials %>%
  count(trial_type, block_type)
# A tibble: 2 × 3
  trial_type     block_type     n
  <chr>          <chr>      <int>
1 prime_negative test        3604
2 prime_positive test        3604

6.6.3 More flexible filtering

Return rows with exactly this contents

Code
data_amp_test_trials |>
  filter(trial_id == "A") # 
# A tibble: 0 × 6
# ℹ 6 variables: unique_id <dbl>, block_type <chr>, trial_type <chr>,
#   trial_id <chr>, rt_ms <dbl>, correct <dbl>

Return rows containing contents but not exactly it

Code
library(stringr)

test <- c("A", "AB", "B")

test == "A"
[1]  TRUE FALSE FALSE
Code
str_detect(test, "A")
[1]  TRUE  TRUE FALSE
Code
str_detect(test, "B")
[1] FALSE  TRUE  TRUE
Code
data_amp_test_trials |>
  filter(str_detect(trial_id, "2_")) 
# A tibble: 7,208 × 6
   unique_id block_type trial_type     trial_id rt_ms correct
       <dbl> <chr>      <chr>          <chr>    <dbl>   <dbl>
 1 504546409 test       prime_positive 2_2        161       0
 2 504546409 test       prime_positive 2_3        328       0
 3 504546409 test       prime_positive 2_4        220       1
 4 504546409 test       prime_negative 2_5        308       1
 5 504546409 test       prime_negative 2_6        235       1
 6 504546409 test       prime_negative 2_7        224       1
 7 504546409 test       prime_negative 2_8        369       0
 8 504546409 test       prime_positive 2_9       1105       1
 9 994692692 test       prime_positive 2_2       1611       0
10 994692692 test       prime_negative 2_3        627       0
# ℹ 7,198 more rows

6.6.3.1 Multiple logical tests

Code
# "|" = OR
# "&" = AND

data_amp_test_trials |>
  filter(str_detect(trial_id, "2_") &
           str_detect(trial_id, "3_"))
# A tibble: 0 × 6
# ℹ 6 variables: unique_id <dbl>, block_type <chr>, trial_type <chr>,
#   trial_id <chr>, rt_ms <dbl>, correct <dbl>
Code
data_amp_test_trials |>
  mutate(rt_ms = ifelse(str_detect(trial_id, "2_"), rt_ms+100, rt_ms))
# A tibble: 7,208 × 6
   unique_id block_type trial_type     trial_id rt_ms correct
       <dbl> <chr>      <chr>          <chr>    <dbl>   <dbl>
 1 504546409 test       prime_positive 2_2        261       0
 2 504546409 test       prime_positive 2_3        428       0
 3 504546409 test       prime_positive 2_4        320       1
 4 504546409 test       prime_negative 2_5        408       1
 5 504546409 test       prime_negative 2_6        335       1
 6 504546409 test       prime_negative 2_7        324       1
 7 504546409 test       prime_negative 2_8        469       0
 8 504546409 test       prime_positive 2_9       1205       1
 9 994692692 test       prime_positive 2_2       1711       0
10 994692692 test       prime_negative 2_3        727       0
# ℹ 7,198 more rows

6.7 Check your learning

What is the difference between select and filter?

Which is for rows and which is for columns?

6.8 Mutating: creating new columns or changing the contents of existing ones

6.8.1 Understanding mutate()

mutate() is used to create new columns or to change the contents of existing ones.

Code
# mutating new variables
example_1 <- data_amp_test_trials %>%
  mutate(latency_plus_1 = rt_ms + 1)

example_2 <- data_amp_test_trials %>%
  mutate(log_latency = log(rt_ms))

# mutating the contents of existing variables
example_3 <- data_amp_test_trials %>%
  mutate(rt_s = rt_ms / 1000) # latency is now in seconds rather than milliseconds

The operations inside mutate can range from the very simple, like the above, to much more complex. The below example uses other functions we haven’t learned yet. For now, just notice that there can be multiple mutate calls and they can produce a cleaned up gender variable.

Code
# illustrate the problem with the gender responses:
data_demographics_trimmed %>%
  # filter only the gender item, not age
  filter(item == "gender") %>%
  count(response) %>%
  arrange(desc(n))
# A tibble: 11 × 2
   response       n
   <chr>      <int>
 1 Male          36
 2 female        27
 3 male          18
 4 Female        11
 5 Non-Binary     2
 6 23             1
 7 FEMALE         1
 8 MALE           1
 9 Woman          1
10 non binary     1
11 yes            1
Code
# clean up the gender variable
data_demographics_gender_tidy_1 <- data_demographics_trimmed %>%
  # filter only the gender item, not age
  filter(item == "gender") %>%
  # change the name of the response variable to what it now represents: gender
  rename(gender = response) %>%
  # change or remove weird responses to the gender question
  mutate(gender = str_to_lower(gender)) %>%
  mutate(gender = str_remove_all(gender, "[\\d.]")) %>% # remove everything except letters
  mutate(gender = na_if(gender, "")) %>% 
  mutate(gender = case_when(gender == "woman" ~ "female",
                            gender == "man" ~ "male",
                            gender == "girl" ~ "female",
                            gender == "yes" ~ NA_character_,
                            gender == "dude" ~ "male",
                            gender == "non binary" ~ "non-binary",
                            TRUE ~ gender)) %>%
  # select only the columns of interest
  select(unique_id, gender)

# illustrate the data after cleaning:
data_demographics_gender_tidy_1 %>%
  count(gender) %>%
  arrange(desc(n))
# A tibble: 4 × 2
  gender         n
  <chr>      <int>
1 male          55
2 female        40
3 non-binary     3
4 <NA>           2

A single mutate call can contain multiple mutates. The code from the last chunk could be written more simply like this:

Code
# clean up the gender variable
data_demographics_gender_tidy_2 <- data_demographics_trimmed %>%
  # filter only the gender item, not age
  filter(item == "gender") %>%
  # change the name of the response variable to what it now represents: gender
  rename(gender = response) %>%
  # change or remove weird responses to the gender question
  mutate(gender = str_to_lower(gender),
         gender = str_remove_all(gender, "[\\d.]"), # remove everything except letters
         gender = na_if(gender, ""), 
         gender = case_when(gender == "woman" ~ "female",
                            gender == "man" ~ "male",
                            gender == "girl" ~ "female",
                            gender == "yes" ~ NA_character_,
                            gender == "dude" ~ "male",
                            gender == "non binary" ~ "non-binary",
                            TRUE ~ gender)) %>%
  # select only the columns of interest
  select(unique_id, gender)

# check they are identical
identical(data_demographics_gender_tidy_1, data_demographics_gender_tidy_2)
[1] TRUE

6.8.2 Practice mutate()

When analyzing cognitive behavioral tasks, it is common to employ mastery criteria to exclude participants who have not met or maintained some criterion within the task. We’ll do the actual exclusions etc. later on, but for practice using mutate() by creating a new fast_trial column to indicate trials where the response was implausibly fast (e.g., < 100 ms).

Try doing this with a simple logical test of whether latency < 100. You can do this with or without using the ifelse() function.

Code
data_amp_test_trials_with_fast_trials <- data_amp_test_trials %>%
  mutate(fast_trial = ifelse(test = rt_ms < 100,
                             yes = TRUE,
                             no = FALSE))

# more briefly but less explicitly
data_amp_test_trials_with_fast_trials <- data_amp_test_trials %>%
  mutate(fast_trial = rt_ms < 100)

6.8.3 Practice mutate() & learn ifelse()

Use mutate() to remove weird values from data_demographics_trimmed$response, for the rows referring to age, that aren’t numbers.

What function could you use to first determine what values are present in this column, to know which could be retained or changed?

In simple cases like this, you can use mutate() and ifelse() to change impossible values to NA.

Code
# what values are present?
data_demographics_trimmed %>%
  filter(item == "age") %>%
  count(response) 
# A tibble: 40 × 2
   response     n
   <chr>    <int>
 1 18           1
 2 19           4
 3 20           1
 4 21           6
 5 22           2
 6 23           5
 7 24           1
 8 25           3
 9 26           4
10 27           5
# ℹ 30 more rows
Code
# fix them with mutate
data_demographics_age_tidy <- data_demographics_trimmed %>%
  filter(item == "age") %>%
  mutate(response = ifelse(test = response == "old",
                           yes = NA_integer_,
                           no = response)) %>%
  mutate(response = as.numeric(response)) %>%
  rename(age = response)

# check this has fixed the issue
data_demographics_age_tidy %>%
  count(age)
# A tibble: 40 × 2
     age     n
   <dbl> <int>
 1    18     1
 2    19     4
 3    20     1
 4    21     6
 5    22     2
 6    23     5
 7    24     1
 8    25     3
 9    26     4
10    27     5
# ℹ 30 more rows

6.8.4 Practice mutate() & ifelse()

Use mutate() to remove weird values from data_selfreport_trials$response that aren’t Likert responses.

First determine what values are present in this column.

Use ifelse() and %in% inside mutate() to change values other than the Likert responses to NA.

If you struggle to do this: practice writing ‘pseudocode’ here. That is, without knowing the right code, explain in precise logic what you want the computer to do. This can be converted to R more easily.

Code
# what values are present?
data_selfreport_trials %>%
  count(response)
# A tibble: 8 × 2
  response     n
  <chr>    <int>
1 1          193
2 2           43
3 3           26
4 4           16
5 5            7
6 6            4
7 7            3
8 Ctrl+'B'     1
Code
# what type of data is the response column?
class(data_selfreport_trials$response)
[1] "character"
Code
# remove non Likert values
data_selfreport_tidy <- data_selfreport_trials %>%
  mutate(response = ifelse(response == "Ctrl+'B'", NA_integer_, response),
         response = as.numeric(response))


# show the data after changes
data_selfreport_tidy %>%
  count(response)
# A tibble: 8 × 2
  response     n
     <dbl> <int>
1        1   193
2        2    43
3        3    26
4        4    16
5        5     7
6        6     4
7        7     3
8       NA     1
Code
class(data_selfreport_tidy$response)
[1] "numeric"

What other ways are there of implementing this mutate, e.g., without using %in%? What are the pros and cons of each?

Code
# write examples here

6.8.5 Practice mutate() & learn case_when()

case_when() allows you to compare multiple logical tests or if-else tests.

The AMP data needs to be reverse scored. Just like an item on a self-report that is worded negatively (e.g., most items: I am a good person; some items: I am a bad person), the negative prime trials have the opposite ‘accuracy’ values that they should. Use mutate() and case_when() to reverse score the negative prime trials, so that what was 0 is now 1 and what was 1 is now 0.

Code
# in your own time later, see if you can rewrite this yourself without looking at the answer to practice using case_when
data_amp_tidy <- data_amp_test_trials_with_fast_trials %>%
  mutate(correct = case_when(trial_type == "prime_positive" ~ correct,
                             trial_type == "prime_negative" & correct == 0 ~ 1,
                             trial_type == "prime_negative" & correct == 1 ~ 0))

# you can also specify a default value to return if none of the logical tests are passed with 'TRUE ~':
data_amp_tidy <- data_amp_test_trials_with_fast_trials %>%
  mutate(correct = case_when(trial_type == "prime_negative" & correct == 0 ~ 1,
                             trial_type == "prime_negative" & correct == 1 ~ 0,
                             TRUE ~ correct))

6.9 Summarizing across rows

It is very common that we need to create summaries across rows. For example, to create the mean and standard deviation of a column like age. This can be done with summarize(). Remember: mutate() creates new columns or modifies the contents of existing columns, but does not change the number of rows. Whereas summarize() reduces a data frame down to one row.

Code
# mean
data_demographics_age_tidy %>%
  summarize(mean_age = mean(age, na.rm = TRUE))
# A tibble: 1 × 1
  mean_age
     <dbl>
1     35.7
Code
# SD
data_demographics_age_tidy %>%
  summarize(sd_age = sd(age, na.rm = TRUE))
# A tibble: 1 × 1
  sd_age
   <dbl>
1   12.4
Code
# mean and SD with rounding, illustrating how multiple summarizes can be done in one function call
data_demographics_age_tidy %>%
  summarize(mean_age = mean(age, na.rm = TRUE),
            sd_age = sd(age, na.rm = TRUE)) |>
  mutate(mean_age = round_half_up(mean_age, digits = 2),
         sd_age = round_half_up(sd_age, digits = 2))
# A tibble: 1 × 2
  mean_age sd_age
     <dbl>  <dbl>
1     35.7   12.4

6.9.1 group_by()

Often, we don’t want to reduce a data frame down to a single row / summarize the whole dataset, but instead we want to create a summary for each (sub)group. For example

Code
# # this code creates data needed for this example - you can simply load the data from disk and skip over this commented-out code. we will come back to things like 'joins' later
# data_demographics_unique_participant_codes <- data_demographics_trimmed %>%
#   count(unique_id) %>%
#   filter(n == 2)
# 
# data_demographics_age_gender_tidy <- data_demographics_trimmed %>%
#   semi_join(data_demographics_unique_participant_codes, by = "unique_id") %>%
#   pivot_wider(names_from = "item",
#               values_from = "response") %>%
#   mutate(age = ifelse(age == "old", NA, age),
#          age = as.numeric(age),
#          gender = tolower(gender),
#          gender = stringr::str_remove_all(gender, regex("\\W+")), # regex is both very useful and awful to write
#          gender = case_when(gender == "female" ~ gender,
#                             gender == "male" ~ gender,
#                             gender == "nonbinary" ~ gender,
#                             gender == "woman" ~ "female",
#                             gender == "man" ~ "male"))
# 
# dir.create("../data/processed")
# write_csv(data_demographics_age_gender_tidy, "../data/processed/data_demographics_age_gender_tidy.csv")

# load suitable example data from disk
data_demographics_age_gender_tidy <-
  read_csv("../data/processed/data_demographics_age_gender_tidy.csv")


# illustrate use of group_by() and summarize()
data_demographics_age_gender_tidy %>%
  summarize(mean_age = mean(age, na.rm = TRUE))
# A tibble: 1 × 1
  mean_age
     <dbl>
1     35.9
Code
data_demographics_age_gender_tidy %>%
  group_by(gender) %>%
  summarize(mean_age = mean(age, na.rm = TRUE))
# A tibble: 4 × 2
  gender    mean_age
  <chr>        <dbl>
1 female        35.3
2 male          37.3
3 nonbinary     24.3
4 <NA>          23  

6.9.2 n()

n() calculates the number of rows, i.e., the N. It can be useful in summarize.

Code
# summarize n
data_demographics_age_gender_tidy %>%
  summarize(n_age = n())
# A tibble: 1 × 1
  n_age
  <int>
1    98
Code
# summarize n per gender group
data_demographics_age_gender_tidy %>%
  group_by(gender) %>%
  summarize(n_age = n())
# A tibble: 4 × 2
  gender    n_age
  <chr>     <int>
1 female       40
2 male         53
3 nonbinary     3
4 <NA>          2

Note that count() is just the combination of group_by() and summiarize() and n()! they produce the same results as above.

Code
# summarize n
data_demographics_age_gender_tidy %>%
  count()
# A tibble: 1 × 1
      n
  <int>
1    98
Code
# summarize n per gender group
data_demographics_age_gender_tidy %>%
  count(gender)
# A tibble: 4 × 2
  gender        n
  <chr>     <int>
1 female       40
2 male         53
3 nonbinary     3
4 <NA>          2

6.9.3 More complex summarizations

Like mutate, the operation you do to summarize can also be more complex, such as finding the mean result of a logical test to calculate a proportion. For example, the proportion of participants who are less than 25 years old:

Code
data_demographics_age_tidy %>%
  summarize(proportion_less_than_25 = mean(age < 25, na.rm = TRUE)) %>%
  mutate(percent_less_than_25 = round_half_up(proportion_less_than_25 * 100, 1))
# A tibble: 1 × 2
  proportion_less_than_25 percent_less_than_25
                    <dbl>                <dbl>
1                   0.202                 20.2

You can also summarize (or indeed mutate) multiple columns in the same way using across(), for do-this-across-columns. We won’t cover how to use this here or all the variations that are possible, just know that it can be done. For example:

Code
# using the mtcars dataset that is built in to {dplyr}, ... 
mtcars %>%
  # ... calculate the mean of every numeric column in the dataset ...
  summarise(across(where(is.numeric), mean, na.rm = TRUE)) %>%
  # ... and then round every column to one decimal place
  mutate(across(everything(), round_half_up, digits = 1))
   mpg cyl  disp    hp drat  wt qsec  vs  am gear carb
1 20.1 6.2 230.7 146.7  3.6 3.2 17.8 0.4 0.4  3.7  2.8

6.9.4 Realise that count() is just a wrapper function for summarize()

Code
dat <- data.frame(x = c(
  rnorm(n = 50),
  rep(NA_integer_, 10)
))

dat |>
  mutate(x_is_na = is.na(x)) |>
  count(x_is_na)
  x_is_na  n
1   FALSE 50
2    TRUE 10
Code
dat |>
  summarise(n_na = sum(is.na(x)))
  n_na
1   10

6.9.5 Practice using summarize()

Calculate the min, max, mean, and SD of all responses on the self report data.

Code
data_selfreport_tidy %>%
  summarize(mean = mean(response, na.rm = TRUE),
            sd = sd(response, na.rm = TRUE),
            min = min(response, na.rm = TRUE),
            max = max(response, na.rm = TRUE))
# A tibble: 1 × 4
   mean    sd   min   max
  <dbl> <dbl> <dbl> <dbl>
1  1.72  1.26     1     7

Currently each participant has up to three responses on the self-report scales (three item scale: like, positive, and prefer). Create a new dataframe containing each unique_id’s mean score across the items. Also calculate how many items each participant has data for, and whether they have complete data (i.e., data for three items).

Code
data_selfreport_scored <- data_selfreport_tidy %>%
  group_by(unique_id) %>%
  summarize(mean_self_report = mean(response),
            n_self_report_items = n()) %>%
  mutate(self_report_complete = n_self_report_items == 3)


# test <- c(3, 5, 7, NA)
# #test <- c(3, 5, 7)
# mean(test)
# mean(test, na.rm = TRUE)
# 
# dat |>
#   summarize(mean = mean(response, na.rm = TRUE))
# 
# dat |>
#   filter(!is.na(response)) |>
#   summarize(mean = mean(response))
# 
# mean_not_dumb <- function(x){mean(x, na.rm = TRUE)}

Using only participants with complete, calculate the mean and SD of all participant’s mean scores on the self-reports.

Code
# data_selfreport_scored %>%

Create a new data frame that calculates the proportion of prime-congruent trials for each participant on the AMP (i.e., the mean of the ‘correct’ column), their proportion of too-fast trials, and their number of trials.

Also add to that data frame a new column called “exclude_amp” and set it to “exclude” if more than 10% of a participant’s trials are too-fast trials and “include” if not.

Code
# data_amp_scored <- data_amp_tidy %>%

Calculate the proportion of participants who are to be excluded.

Code
# data_amp_scored %>%

6.10 Check your learning

What is the difference between mutate() and summarize()? If I use the wrong one, will I get the same answer? E.g., mutate(mean_age = mean(age, na.rm = TRUE)) vs. summarize(mean_age = mean(age, na.rm = TRUE))

6.11 Writing data to disk

Code
# write_csv(data_processed, "../data/processed/data_processed.csv")